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(g) Method and apparatus for operating an array of storage devices. 
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@ A storage controller operates an array of 
parity protected data storage units as a RAID 
level 5. One of the storage units is a dedicated 
write assist unit The assist unit is a temporary 
storage area for data to be written to the other 
units. When the array controller receives data 
from a host, it first writes the data to the assist 
unit. Because the assist unit is not parity protec- 
ted and is only temporary storage, It is possible 
to write data to the assist unit sequentially, 
without first reading the data, greatiy reducing 
response time. The array controller signals the 
CPU that the data has been written to storage as 
soon as it has been written to the assist unit 
Parity in the array Is updated asynchronously. In 
the event of system or storage unit failure, data 
can be recovered using the remaining storage 
units and/or the assist unit. The write assist unit 
also doubles as a spare unit Data recovered 
from a failed unit can be stored on the write 
assist, which then ceases to function as a write 
assist unit and assumes the function of the 
failed storage unit. 
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The present invention relates to computer data 
storage apparatus, and in particular to arrays of direct 
access storage devices commonly known as 
"RAIDS". 

The extensive data storage needs of modern 
computer systems require large capacity mass data 
storage devices. A common storage device is the 
magnetic disk drive, a complex piece of machinery 
containing many parts which are susceptible to fail- 
ure. A typical computer system will contain several 
such units. The failure of a single storage unit can be 
a very disruptive event for the system. Many systems 
are unable to operate until the defective unit is re- 
paired or replaced, and the lost data restored. 

As computer systems have become larger, faster, 
and more reliable, there has been a corresponding In- 
crease In need for storage capacity, speed and reli- 
ability of the storage devices. Simply adding storage 
units to increase storage capacity causes a corre- 
sponding increase in the probability that any one unit 
will fail. On the other hand, increasing the size of ex- 
isting units, absent any other improvements, tends to 
reduce speed and does nothing to improve reliability. 

Recently there has been considerable interest in 
arrays of direct access storage devices, configured to 
provide some level of data redundancy. Such arrays 
are commonly known as "RAIDs" (Redundant Array of 
Inexpensive Disks). Various types of RAIDs providing 
different forms of redundancy are described in a pa- 
per entitled "A Case for Redundant Arrays of I nexpen- 
sive Disks (RAID)", by Patterson, Gibson and Katz, 
presented at the ACM SIGMOD Conference, June, 
1988. Patterson, et al., classify five types of RAIDs 
designated levels 1 through 5. The Patterson nomen- 
clature has become standard in the industry. The un- 
derlying theory of RAIDs is that a large number of rel- 
atively small disk drives, some of which are redun- 
dant, can simultaneously provide increased capacity, 
speed and reliability. 

Using the Patterson nomenclature, RAID levels 3 
through 5 (RAID-3, RAID-4, RAID-5) employ parity 
records for data redundancy. Parity records are 
formed from the Exdusive-OR of all data records 
stored at a particular locatton on different storage 
units in the array. In other words, in an array of N stor- 
age units, each bit in a block of data at a partlcularlo- 
cation on a storage unit Is Exclusive-ORed with every 
other bit at that location In a group of (N-1) storage 
units to produce a block of parity bits; the parity block 
is then stored at the same location on the remaining 
storage unit. If any storage unit In the array fails, the 
data contained at any location on the failing unit can 
be regenerated by taking the Exclusive-OR of the 
data blocks at the same location on the remaining de- 
vices and their corresponding parity block. 

IRAID-4 and RAID-5 are further characterized by 
independently operating read/write actuators in the 
storage units, in other words, each read/Write head of 



a disk drive unit is free to access data anywhere on 
the disk, without regard to where other units in the ar- 
ray are accessing data. US-A-No. 4,761 ,785 to Clark 
et al., which is hereby incorporated by reference, de- 
5 scribes a type of independent read/write array in 
which the parity blocks are distributed substantially 
equally among the storage units in the array. Distrib- 
uting the parity blocks shares the burden of updating 
parity among the disks in the array on a mors or less 
10 equal basis, thus avoiding potential performance bot- 
tlenecks that may arise when all parity records are 
maintained on a single dedicated disk drive unit. Pat- 
terson et al. have designated the Clark array RAID- 
5. RAID-5 is the most advanced level RAID described 
IS by Patterson, offering improved performance over 
other parity protected RAIDs. 

One of the problems encountered with parity pro- 
tected disk arrays having independent read/writes 
(I.e., RAlD-4 or F^ID-5) is the overhead associated 
20 with updating the parity block whenever a data block 
is written. Typically, as described in Clark, et al., the 
data block to be written is first read and the old data 
Exclusive-ORed with the new data to produce a 
change mask. The parity block is then read and Ex- 
25 clusive-ORed with the change mask to produce the 
new parity data. The data and parity blocks can then 
be written. Thus, two read and two write operations 
are required each time data Is updated. 

In a typical computer system, the central proo- 
30 essing unit (CPU) operates much faster than the stor- 
age devices. The completion of the two read and two 
write operations by the storage devices which are 
necessary for updating data and parity require a com- 
paratively long period of time in relation to CPU oper- 
as atlons. If the CPU holds off further processing of a 
task until the data update in the storage devices is 
completed, system performance can be adversely af- 
fected. It is desirable to permit the CPU to proceed 
with processing a task immediately or shortly after 
40 transmitting data to the disk array for writing, while 
still maintaining data redundancy. 

A single parity block of a RAID-3, RAID-4 or 
RAID-5 provides only one level of data redundancy. 
This ensures that data can be recovered in the event 
45 Of failure of a single storage unit. IHowever, the system 
must be designed to either discontinue operations in 
the event of a single storage unit failure, or continue 
operations without data redundancy. If the system is 
designed to continue operations, and a second unit 
so fails before the first unit is repaired or replaced and 
its data reconstructed, catastrophic data loss may oc- 
cur. In order to support a system that remains opera- 
tional at all times, and reduces the possibility of such 
catastrophic data loss, it is possible to provide addi- 
55 tional standby storage units, known as "hot spares". 
Such units are physically connected to the system, 
but do not operate until a unit fails. In that event, the 
data on the falling unit is reconstructed and placed on 



3 



EP 0 569 313 A2 



4 



the hot spare, and the hot spare assumes the ro!e of 
the failing unit. Although the hot spares technique en- 
abies a system to remain operational and maintain 
data redundancy in the event of a device failure, it re- 5 
quires additional storage units (and attendant cost) 
which otherwise serve no useful function. 

It is therefore an object of the present invention 
to provide an enhanced method and apparatus for 
storing data in a computer system. io 

Another object of this Invention is to provide an 
enhanced method and apparatus for managing a re- 
dundant array of storage devices in a computer sys- 
tem. 

Anot her object of t his invention is to increase t he is 
performance of a computer system having a redun- 
dant array of storage devices. 

Another object of this invention to provide an en- 
hanced method and apparatus whereby a computer 
system having a redundant array of storage devices 20 
may continue to operate if one of the storage units 
fails. 

Another object of this invention to reduce the cost 
of providing increased performance and data redun- 
dancy in a computer system having a redundant array 2S 
of storage devices. 

An array storage controller services a plurality of 
data storage units in an array. A storage management 
mechanism resident on the controller maintains parity 
records on the storage units it services. Data and par- 30 
ity blocks are preferably organized as described in 
the patent to Clark et al. (RAID-5). The array control- 
ler contains a RAM cache for temporarily storing up- 
date data, read data, and change masks for parity 
generation. 35 

One of the storage units in the array is a dedicat- 
ed write assist unit. The assist unit is a temporary stor- 
age area for data to be written to other units in the ar- 
ray. When the array controller receives data to be 
written to storage, it first writes the data to the assist 40 
unit. Because the assist unit is not parity protected, it 
is not necessary to first read the data on the assist 
unit. Furthermore, because the unit is only temporary 
storage, it is possible to write data to the assist unit 
sequentially, greatly reducing seek and latency times. 45 

The array controller signals the CPU that the 
data has been written to storage as soon as it has 
been written to the assist unit It is still necessary to 
perform two read and two write operations to update 
the data, as described in Clark, et al. However, these so 
operations can proceed asynchronously with further 
processing of the task in the CPU. 

The storage management mechanism maintains 
status information in the array controller's memory 
concerning the current status of data being updated. ss 
The amount of memory required for such status infor- 
mation is relatively small, much smaller than the data 
itself. This status information, together with the write 
assist unit, provide data redundancy at all times. In 



the event of a failure of the assist unit, the array con- 
troller continues to update data from the contents of 
its RAM as if nothing had happened. In the event of 
a failure of a storage unit in the array other than the 
assist unit, the data on that unit can be reconstructed 
using the remaining units in the array (including the 
assist unit) and the status information. Finally, in the 
event of failure of the controller itself, the storage 
units (including the assist unit) contain information 
needed for complete recovery. 

The write assist unit also doubles as a spare unit 
in the event of failure of another unit in the array. After 
any incomplete write operations are completed and 
parity updated, the data in the failed storage unit is re- 
constructed by Exclusive-ORing all the other units, 
and this data is stored on the assist unit. The assist 
unit then ceases to function as an assist unit, and 
functions as the failed storage unit that it replaced. 
The system then continues to operate normally, but 
without a write assist unit. The only effect is that data 
updates cause a greater impact to system perfor- 
mance, but data is otherwise fully protected. 

Fig. 1 is a block diagram of a system incorporating 

the components of the preferred embodiment of 

this invention; 

Fig. 2 is a diagram of the major components of a 
disk array controller according to the preferred 
embodiment; 

Figs. SAand SB are a flow diagram showing the 
steps involved in performing a fast write task ac- 
cording to the preferred embodiment; 
Fig. 4 is a flow diagram showing the steps in- 
volved in performing a service unit write task ac- 
cording to the preferred embodiment; 
Fig. 5 is a graphical representation of a test to de- 
termine whether a WRITE command should be 
written to the write assist unit according to the 
preferred embodiment; 

Fig. 6 shows the structure of a data record written 
to the write assist unit according to the preferred 
embodiment, 

Fig. 7 shows the structure of a header/trailer 
block within a data record written to the write as- 
sist unit, according to the preferred embodiment; 
Fig. 8 is a high-level flow diagram showing the 
steps taken by the array controller in the event of 
failure of one of the service disk units, according 
to the preferred embodiment; 
Fig. 9 shows the steps required to complete any 
incomplete write operations in the event of failure 
of one of the service disk units, according to the 
preferred embodiment; 

Fig. 10 shows the steps required to obtain the 
most recent uncommitted list from write assist 
disk unit during data recovery, according to the 
preferred embodiment; 

Fig. 11 shows the steps required to complete all 
incomplete WRITE operations identified on an 
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uncommitted list recovered from the write assist 
unit, accofding to the preferred embodiment. 
A block diagram of the major components of com- 
puter system 1 00 of the preferred embodiment of the 5 
present invention is shown in Fig. 1. A host system 
101 communicates over a high-speed data bus 102 
with a disk array controller 103. Controller 103 con- 
trols the operation of storage units 104-108. In the 
preferred embodiment, units 104-108 are rotating io 
magnetic disk drive storage units. While five storage 
units are shown in Fig. 1 , it should be understood that 
the actual number of units attached to controller 103 
is variable. It should also be understood that more 
than one controller 103 may be attached to host 101 . is 
While host 101 is depicted in Fig. 1 as a monolithic en- 
tity* it will be understood by those skilled in the art 
that host 101 typically comprises many elements, 
such as a central processing unit (CPU), main mem- 
ory, internal communications busses, and I/O devices 20 
including other storage devices. In the preferred em- 
bodiment, computer system 100 is an IBM AS/400 
computer system, although other computer systems 
could be used. 

Disk unit 104 is a write assist disk unit The re- 2S 
maining units 105-108 are designated service units. 
The write assist unit 104 is a temporary storage area 
for data to be written to the service units 105-1 08. For 
fast access, data Is written sequentially to assist unit 
104. The storage area of each service unit 105-108 30 
is logically divided into blocks 111-118. In the prefer- 
red embodiment, disk units 104-108 are physically 
identical units (except for the data stored thereon) 
having identical storage capacity, and blocks 111-118 
are the same size. While it would be possible to em- 35 
ploy this Invention in configurations of varying sized 
storage units or varying sized blocks, the preferred 
embodiment simplifies the control mechanism. 

The set of all blocks located at the same location 
on the several service units constitute a stripe. In Fig. 40 
1 , storage blocks 111-114 constitute a first stripe, and 
blocks 115-118 constitute a second stripe. At least 
one of the blocks in each stripe is dedicated to data 
redundancy, and contain parity or some form of error 
correcting code. In the preferred emt)odiment, data 45 
redundancy takes the form of a single parity block in 
each stripe. Parity blocks 111,116 are shown desig- 
nated "P" in Fig. 1. The remaining blocks 112- 
11 5, 11 7- 118 are data storage blocks for storing data. 
The parity block for the stripe consisting of blocks so 
1 1 1 -1 14 is block 111. The parity block contains the Ex- 
clusive-OR of data in the remaining blocks on the 
same stripe. 

In the preferred embodiment, parity blocks are 
distributed across the different service disk units in a ss 
round robin manner, as shown in Fig. 1. Because with 
every write operation the system must not only up- 
date the block containing the data written to, but also 
the parity block for the same stripe, parity blocks are 



usually modified more frequently than data blocks. 
Distributing parity blocks among different service 
units will in most cases improve performance by dis- 
tributing the access workload. However, such distrib- 
ution is not necessary to practicing this invention, and 
in an alternate embodiment it would be possible to 
place all parity blocks on a single disk unit. 

The allocation of storage area on the service 
units into stripes as described above, each containing 
blod^ of data and a parity block, is the same as that 
described In US-A-No. 4,761,785 to Clark, et al.. 
which is incorporated by reference. 

Array controller 103 is shown in greater detail in 
Fig. 2. Controller 103 comprises programmable proc- 
essor 201, random access memory (RAM) 202, bus 
interface circuitry 205. and disk unit Interface circuitry 
206, which communicate with each other via various 
internal communication paths as shown. Bus inter- 
face circuitry 205 sends and receives communica- 
tions with host 101 via high speed bus 102. Disk unit 
interface circuitry 206 sends and receives communi- 
catkins with disk units 104-108. Programmable proc- 
essor 201 controls the operation of array controller 
103 by executing a storage management control pro- 
gram 210 resident in memory 202. Controller 103 in- 
cludes means for performing Exclusive-OR opera- 
tions on data which are required for maintaining parity 
and data recovery, as described below. Exclusive- 
OR operations may be performed by processor 201, 
or by special purpose hardware (not shown). 

Memory 202 comprises dynamic RAM portion 
203 and non-volatile RAM portion 204. Non-volative 
RAM 204 is RAM which maintains its data in the ab- 
sence of system power. The contents of dynamic 
RAM 203 are lost when the system loses power. Dy- 
namic RAM circuits using currentiy available technol- 
ogy are considerably less expensive and/or have 
shorter access time than non-volatile RAM. Hence, it 
is desirable to use dynamic RAM for storage of all but 
the most critical data. In the preferred embodiment, 
a portion of control program 210 necessary for initial- 
ization of the array controller 103 is stored in non-vol- 
atile RAM 204; the remaining part of control program 
210 is loaded from host 101 when the system Is Ini- 
tially powered-up, and stored in dynamic RAM 203, as 
shown in Fig. 2. 

Memory 202 contains several records which sup- 
port operation of the write assist unit in accordance 
with the preferred embodiment. Uncommitted list 212 
in dynamic RAM 203 is a list representing those 
WRITE operations which may be incomplete. In par- 
ticular, after array controller 103 receives a WRITE 
command from host 101, writes the data to write as- 
sist unit 104, and signals the host that the operation 
is complete, there will typically be some time delay 
before the data is actually written to the service units 
105-108 and parity updated. Uncommitted list 212 re- 
cords those operations which may be in such a pend- 
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ing status. If a device failure should occur before the 
data can be written to the service units and parity up- 
dated, uncommitted list 212 will be used for recovery, 
as described more flilly below. In the preferred em- 
bodiment, uncommitted list 212 Is a variable length 
list of addresses on assist unit 104 at which the re- 
spective incomplete WRITE operations have been 
stored. 

Non-volatile RAIVI 204 contains status record 211 . 
Status information includes an address of a recent un- 
committed write operation on assist unit 104, which is 
used to reconstruct data in the event of ioss of the 
contents of dynamic RAM 203, and the current status 
of each disk unit 104-108 in the array (i.e., whether 
the unit is on-line and functioning, and whether it is 
configured as an assist unit or a service unit). Menrv- 
ory 202 may include other records not shown. 

In addition to control program 210 and the re- 
cords described above, dynamic RAM 203 Is used as 
a cache for temporary storage of data being read from 
or written to storage units 104-108. 

The operation of computer system 100 in con- 
junction with the hardware and software features 
necessary to the present invention will now be descri- 
bed. To host 1 01 , the array controller 103 and attach- 
ed disk units 104-108 appear as a single storage en- 
tity. IHost 101 issues READ and WRITE commands to 
array controller 103, requesting that it respectively 
read data from, or write data to, the disk units. Host 
101 receives read data or a completion message 
when the respective operation is complete. i-Iost 101 
is unaware of the mechanics of updating parity and 
other disk maintenance performed by controller 103. 

In normal operation, write assist disk unit 104 Is 
only written to, and not used during the READ oper- 
ation. Controller 1 03 executes a READ operation by 
accepting a READ command from host 101, and de- 
termining whether the data requested exists in the 
controller's dynamic RAM 203. If so, the data in RAM 
203 Is sent directly to the host Otherwise, data Is first 
read from the appropriate storage unit Into dynamic 
RAM 203, and from there transferred to the host sys- 
tem. Depending on the size of dynamic RAM 203, 
data may be saved there awaiting a WRITE operation 
for the same data. If the original version of data to be 
updated is already in RAM 203 when the WRITE op- 
eration is processed, it will not be necessary to read 
the data again in order to update parity, thus improv- 
ing system performance, in some applications, the 
host may be able to indicate to the controller which 
data read is likely to be modified. 

A WRITE operation is performed by two asyn- 
chronous tasks which are part of control program 21 0 
r-unning in the array controller's processor 201 . Af irst 
task (the fast write task, shown in Figs. 3A and 3B) 
manages the write assist disk unit 104 and decides 
when to tell host 101 that the operation is complete. 
A second task (the service unit write task, shown in 



Fig. 4) performs the writing of data and updating of 
parity to the service disk units 105-108. 

The WRITE operation In the array controller is 

5 triggered by receipt of a WRITE command firom the 
host at step 301 . The WRITE command is placed on 
a write service queue in memory 202 at step 302. The 
service unit write task will retrieve the command from 
the queue and process it in due course. The fast write 

10 task continues down the branch starting at step 303 
In Fig. 3A. 

The fast write task begins by checking status re- 
cord 211 to determine whether the write assist func- 
tion is active at step 303. This function may be deac- 

IS tlvated if one of the service disks 1 05-1 08 has failed, 
and data on this service disk has been reconstructed 
on write assist disk 104, as described below. If the 
write assist function has been deactivated, the fast 
write task simply waits at step 305 for the service unit 

20 write task to complete. If the write assist function is ac- 
tive, the fast write task proceeds to analyze the com- 
mand. 

In the preferred embodiment, the write assist 
disk (WAD) unit 104 is not used for all WRITE opera- 

25 tions. The fast write task first makes a determination 
whether assist unit 104 should be used for caching 
the WRITE data at step 304, as described more fully 
below. Analysis of performance of the storage sub- 
system of the present Invention has shown that the 

30 greatest performance improvement is obtained from 
caching small WRITE operations, and that the rela- 
tive performance improvement declines as the 
amount of data to be written becomes larger. Eventu- 
ally, the data to be written can become sufficiently 

35 large that use of the write assist unit causes no Im- 
provement, or an actual decline in performance. 

There are several reasons for this. The use of the 
write assist unit always entails additional work for the 
storage subsystem, because the amount of work re- 

40 quired to update the service units remains un- 
changed. This additional overhead burden must be 
justified by the performance advantage gained by an 
early signalling that the operation is complete. The 
assist unit reduces seek and latency times by operat- 

45 ing sequentially. For small WRITE operations, the re- 
sponse time attributable to seek and latency is rela- 
tively greater than for large WRITE operations, hence 
the perfornnance improvement attributable to the as- 
sist unit is relatively greater. Additionally, where a 

so large WRITE operation is writing data to two or more 
blocks on the same stripe of the service units, it Is 
possible to omit or combine certain steps required to 
update the parity block (as described more fully be- 
low), so that fewer than two reads and two writes are 

55 required per block of data written. Finally, because 
there Is only one write assist unit in the preferred em- 
bodiment, and a plurality of service units, It is possible 
for a backlog to develop in the assist unit. 

Ideally, the determination whether to use the as- 
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sist unit at step 304 is based on two considerations: 
the resources available for the operation, and an es- 
timate of the time required to complete the write to the 
assist unit (as opposed to time required to complete s 
the write to the service units). In the preferred em- 
bodiment, the assist unit will be used for a WRITE op- 
eration if all the following criteria are met: 

(a) The number of data blocks in the WRITE com- 
mand under consideration is less than Threshold io 
#1 , where Threshold #1 represents some limit on 

the size of buffers or other resources available to 
handle the WRITE command; 

(b) The number of data blocks in the WRITE com- 
mands on the WAD queue is less than Threshold is 
#2, this number being roughly proportional to the 
time to begin any new command added to the- 
WAD queue; and 

(c) The number of data blocks in the WRITE com- 
mand plus the number of data blocks on the WAD 20 
queue is less than Threshold #3, this sum being 
roughly proportional to the time required to com- 
plete the write of the command under considera- 
tion to the assist unit, where Threshold #3 could 
represent either a limit on WAD queue resources 25 
or a maximum time allowed for completing a com- 
mand. 

This test is shown graphically In Fig. 5. The axes 
501 ,502 represent the number of blocks in the WRITE 
command under consideration and the number of 30 
blocks currently in the WAD queue, respectively. The 
shaded area 503 represents a determination that the 
assist unit should be used. 

If controller 103 determines at step 304 that the 
WRITE operation does not meet t he criteria for use of 35 
the write assist unit, the fast write task simply waits 
at step 305 for the service unit write task to complete. 
When the service unit task completes, the first task 
then sends a command complete message to host 
101, acknowledging that the WRITE operation has 40 
completed, at step 311. 

If controller 103 determines at step 304 that the 
WRITE operation meets the criteria for use of the 
write assist unit, the WRITE command is placed on a 
write assist disk queue at step 306 awaiting service 4S 
by the assist unit 104. The fast write task then waits 
at steps 307-308 for either the service unit task to 
complete or the WRITE command In the write assist 
disk queue to reach a point of no return (i.e., to reach 
a point where the write assist unit 1 04 is ready to re- so 
ceive the data), if the service unit task completes first 
("write to array done" at step 307), the write command 
is removed from the write assist disk queue at step 
310, and a command complete message is sent to 
host 101 at step 311. 55 

If the WRITE command on the write assist disk 
queue reaches the point of no return before the ser- 
vice unit task completes (step 308), the data is written 
to write assist unit 1 04 at step 312. The steps required 



to complete this part of the operation are shown in 
Fig. 3B. The WRITE command is first added to un- 
connmitted list 212 in dynamic RAM 203 at step 321. 
Backup copies of the uncommitted list also exist In 
write assist unit 104, as more fiiliy described below. 
The controller then builds a header and trailer onto 
the write data, and sends this data to write assist unit 
104, at step 322. The fast write task then waits at 
steps 323,324 until either the write task to the service 
units completes or the data sent to the write assist 
unit is physically written to the assist unit. If the ser- 
vice unit write task completes first (step 323), control- 
ler 103 sends a command complete message to host 
101 (step 325), and removes the WRITE command 
from the uncommitted list (step 328). If the writing of 
data to the write assist unit completes first (step 324), 
the controller sends the command complete mes- 
sage to host 1 01 at step 326. The fast write task then 
waits for the service unit task to complete at step 327. 
After the service unit task has completed, the WRITE 
command is removed from the uncommitted list at 
step 328. 

In typical operation, WRITE commands will be 
processed by following a path represented by blocks 
301,302,303.304,306,307,308,321,322,323,324,326. 
327,328. In following this path. It will be observed that 
the command complete message Is sent to the host 
(step 326) before the actual writing of data to the ser- 
vice units completes (step 327). Thus, the host is free 
to continue processing as if data contained in the 
WRITE command had actually been physically written 
to the storage units and parity updated, although In 
fact this has not necessarily been done. 

The second asynchronous task (service unit 
write task) writes data from dynamic RAM 203 to a 
service disk unit and updates parity. A flow diagram 
of this task is shown in Fig. 4. it selects a WRITE op- 
eration from among those queued in memory 202 at 
step 401. The selection criteria are not a part of this 
invention, and could be, e.g., FIFO, shortest 
seek/latency, or some other criteria based on system 
performance and other considerations. When the 
WRITE operation is performed, parity must be updat- 
ed. By taking the Exdusive-OR of the new write data 
with the old data, it is possible to obtain a bit map of 
those bits being changed by the WRITE operation. 
Exclusive-ORing this bit map with the existing parity 
data will produce the updated parity data. Therefore, 
before writing to storage, the task first checks wheth- 
er the old data exists in dynamic RAM 203 in unmo- 
dified form at step 402. If not, it must be read into RAM 
203 from the data block on the service disk unit on 
which it Is stored at step 403. This old data in RAM 203 
is then Exclusive-ORed with the new data in RAM 
203 to produce the bit map of changed data at step 
404. The bit map is saved temporarily in RAM 203 
while the new data Is written to the same data block 
on the appropriate service disk unit at step 405. The 
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old parity data is then read into RAM 203 (if not al- 
ready there) from the corresponding parity block in 
the same stripe of blocks at steps 406,407, and Ex- 
clusive-ORed with the bit map to produce the new 
parity data at step 408. This new parity data Is written 
back to the same pan'ty block on the disk unit at step 
409, competing the second task. An appropriate mes- 
sage or interrupt is passed to the first task when the 
second task completes. 

The steps shown in Fig. 4 are typical of a small 
write operation, specifically, a write operation involv- 
ing data stored on a single block of a service disk. 
Where a large write operation involves multiple blocks 
within the same stripe, it is possible to omit or com- 
bine certain steps to achieve a performance improve- 
ment. For example, where two blocks on a single 
stripe are being written to, the controller would typi- 
cally (1) read data In a first block, (2) Exclusive-OR 
the data read with the new data to be written to pro- 
duce a change mask, (3) write new data to the first 
block, (4) read data in a second block, (5) Exclusive- 
OR the data read with the change mask from the first 
block to update the change mask, (6) Exclusive-OR 
the change mask with the data to be written to the 
second block to again update the change mask, (7) 
write new data to the second block, (8) read the parity 
block. (9) Exclusive-OR the parity block with the 
change mask to produce the new parity, and (10) 
write the new parity. Note that in this case, although 
two separate blocks were updated, only three writes 
and three reads were required. In the case where 
most or all blocks within a stripe are being written to, 
it Is more efficient to access all blocks rather than 
read before each write, in this case, the controller will 
first read each block not being updated, accumulating 
a parity by Exclusive-ORing, and then write each 
block being updated, again accumulating the parity 
by successive Exdusive-ORing. After the last write of 
data, the accumulated parity is written to the parity 
block. For these reasons, the use of the write assist 
disk unit is less attractive for large WRITE operations. 
Accordingly, in the preferred embodiment the control- 
ler makes an Initial determination at step 303 whether 
the WRITE operation is sufficiently small that use of 
the write cache unit will be likely to Improve perfor- 
mance. 

In order to maintain data redundancy at all times, 
the information written to write assist unit 104 in- 
cludes status information necessary to reconstruct 
data in the event the contents of dynamic memory 
203 are lost. Therefore, for each write of data to the 
assist unit, the controller builds a header/trailer con- 
taining this status information as indicated at step 
322. A high-level diagram of the structure of a data re- 
cord written to assist unit 104 is shown in Fig. 6. Atyp- 
ical data record 601 comprises a header block 602, a 
variable number of data blocks 603-605, followed by 
a trailer block 606. and one of more blocks of a per- 



formance gap 607. 

l-leader and trailer blocks 602,606 contain only 
status and other information needed to reconstruct 

5 data. The data itself which is written to the service 
units 105-108 is contained entirely within data blocks 
603-605. Trailer block 606 is a verbatim copy of the 
first header block 602. The purpose of Inserting trailer 
block 606 is to verify during data reconstruction that 

10 all data blocks were in fact written to the write assist 
unit 104. 

Performance gap 607 is a predefined number of 
blocks containing undefined data. The purpose of gap 
607 is to allow the controller sufficient time to process 

15 the next WRITE command where multiple commands 
are on the WAD queue. While the controller is proc- 
essing the next WRITE command on the queue (i.e., 
building header/trailer, checking status) the write as- 
sist disk unit continues to rotate a small angular dis- 

20 tance past the end of the record. If the next record is 
to be started at the immediately succeeding block lo- 
cation, the controller must wait for a full disk revolution 
to complete before the next write operation can begin. 
In order to avoid this, performance gap 607, which 

25 contains unused data, Is inserted at the end of a re- 
cord. By the time the disk rotates past the block(s) 
comprising performance gap 607, the controller will 
be ready for the next WRITE operation. While one 
performance gap block 607 is depicted in Fig. 6, it 

30 should be understood that the actual number of such 
blocks may vary depending on the characteristics of 
the disk unit. 

In addition to data record 601, the controller will 
under certain circumstances write an update record to 

35 write assist unit 1 04. An update record comprises only 
the header block(s). The update record is appended 
to the end of a chain of data records 601 when no fur- 
ther WRITE operations are on the WAD queue await- 
ing writing to the assist unit 104. In this case, the up- 

40 date record is eventually overwritten with another up- 
date record (if there are status changes in the uncom- 
mitted list) or a data record which is added to the ex- 
isting chain. The update record is also appended to a 
chain of data records 601 at the end of a disk sweep 

45 (i.e., the disk arm has swept across the entire disk 
surface, and must return to the starting point of its 
sweep to write the next record). Because data records 
are never split between the end and beginning of a 
sweep, an update record pointing to the start of a 

50 sweep will be inserted at the end of a chain whenever 
the disk space remaining in the sweep Is Insufficient 
to store the next data record. 

The structure of a header or trailer block is shown 
in Fig. 7. The block contains command identifier 701, 

55 command address 702. number of status blocks 703, 
next command address 704, number of entries in un- 
committed list 705, uncommitted list entries 706,707. 
padding 708, SCSI command 709 and command ex- 
tension 710. 
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Command identifier 701 is a unique 4-byte iden- 
tifier generated by controller 103 and associated with 
the write record 601. The controller increments the 
identrf ler by 1 each time it writes a new record to write 
assist unit 1 04; the identifier wraps to 0 after reaching 
X 'FFFFFFFF*. When traversing a chain of commands 
stored on the assist unit as part of data reconstruction 
(as described below), the identifier is used to verify 
that the next record is indeed part of the chain. 

Command address 702 contains the address on 
the write assist unit at which the record begins. Num- 
ber of status blocks 703 contains the number of 
blocks in the header record. In the preferred embodi- 
ment, this number is typically 1 (each block contain- 
ing 520 bytes of data). However, if the uncommitted 
list Is unusually long, the header could occupy more 
than one block. The trailer, on the other hand, repeats 
only the first block of the header, even where the 
header comprises multiple blocks. 

Next command address 704 contains the ad- 
dress on the write assist unit at which the next record 
in the chain is stored. In the case of a data record, this 
is the address of the block immediately after perfor- 
mance gap 607 (which Is the start of either an update 
record or the next data record). In the case of an up- 
date record which was appended to the last data re- 
cord in a chain, the next command address is the 
starting address of the update record itself (i.e., the 
update record points to itself as the next block, signal- 
ling the end of the chain). If the update record was 
generated because the record was the last record in 
a disk arm sweep, the next address in the header 
block points to the beginning address of the write as- 
sist disk. When the write assist disk is initially format- 
ted, an empty update record containing only a header 
block is inserted at the beginning address; in this 
case, the next command address of this header block 
points to itself. Thus, in traversing a chain of records 
during data reconstruction, the controller will follow 
each pointer in next command address 704 until It en- 
counters one which points to itself. 

Number of entries 705 contains the number of en- 
tries in the uncommitted list which follows. Each entry 
706,707 in the uncommitted list is an address on the 
write assist unit of a header block for a record which 
has not yet been written to the service units, as de- 
scribed above. The uncommitted list in the head- 
er/trailer block is a copy of the uncommitted list 212 
in dynamic RAM as it existed at the time the head- 
erytrailer was generated. Once written, the uncommit- 
ted list In a data record is not updated to reflect the 
current state of the uncommitted list 212 in dynamic 
RAM. Instead, a more recent uncommitted list will be 
recorded in a subsequently written header of a data 
or update record. Although two entries 706,707 are 
shown in Fig. 7, the actual number of entries is vari- 
able. 

SCSI command 709 and Command extension 



710 are stored at a fixed location relative to the end 
of the header/trailer block. Padding 708 contains un- 
used data of variable length required to fill the block 

5 to the beginning of SCSI command 709. SCSI conv 
mand 709 contains the write command issued to the 
service units 105-108, which in the preferred embodi- 
ment employ a Small Computer Systems Interface 
(SCSI) protocol for communication with the controller 

10 103. Among other things, SCSI command contains 
the length of the data to be written, which data follows 
the header block. Command extension 710 may con- 
tain additional command parameters not part of the 
SCSI command. In the preferred embodiment, com- 

15 mand extension 710 is used for a bit-mapped skip 
mask, enabling selected data blocks in the record to 
be written while others are skipped. 

The storage subsystem of the present Invention 
is designed to preserve data In the event of failure of 

20 any single disk unit or loss of contents of the array 
controller dynamic memory 204. In the former event, 
the subsystem can dynamically recover and continue 
operation. The latter event Is generally indicative of a 
loss of system power or such other catastrophic event 

25 in which the system as a whole is affected. In this 
case, the integrity of data on the storage units Is pre- 
served, although the controller will not necessarily be 
able to continue operation until the condition causing 
the failure is corrected. 

30 From the perspective of array controller 103, 

each storage unit 104-108 is a self-contained unit 
which is either functioning properly or Is not. The stor- 
age unit itself may contain internal diagnostic and er- 
ror recovery mechanisms which enable it to over- 

35 come certain types of internal defects. Such mecha- 
nisms are beyond the scope of the present invention. 
As used herein, the failure of a storage unit means 
failure to function, i.e., to access data. Such a failure 
may be, but is not necessarily, caused by a break- 

40 down of the unit itself. For example, the unit could be 
powered off, or a data cable may be disconnected. 
From the perspective of the controller, any such fail- 
ure, whatever the cause, is a failure of the storage 
unit. Detection mechanisms which detect such fatl- 

45 ures are known in the art. 

In the event of failure of write assist unit 104, ar- 
ray controller 103 updates its status information in 
non-volatile RAM to reflect that the assist unit is no 
longer in service, and thereafter continues operation 

so of the service units as befbre, without using the write 
assist unit. 

Figs. 8 and 9 represent the steps taken by array 
controller 103 In the event a failure of one of the ser- 
vice units 105-108 Is detected. Fig. 8 Is a high-level 
55 flow diagram of the overall recovery process. The 
controller first deactivates the write assist function so 
that no more WRITE commands are written to the 
write assist unit at step 801. The controller then com- 
pletes the writing of any incomplete WRITE opera- 
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tions in its uncommitted list 212 to the service units, 
including the updating of parity, at step 802. The con- 
troller then dynamically reassigns storage space pre- 
viously allocated to the failed service unit to the write 5 
assist unit at step 803. Data on the failed service unit 
is then reconstructed by Exdusive-ORing the data at 
the same location on the remaining service units, and 
saved on the unit formerly allocated as the write as- 
sist unit, at step 804. There may be some overlap of io 
steps 802-804. The subsystem then continues nor- 
mal function without write assist, with the write assist 
unit 104 performing the function of the failed service 
unit, at step 805. 

Fig. 9 illustrates the steps required to complete is 
any Incomplete WRITE operations, which are repre- 
sented in Fig. 8 by the single block 802. There are 
several possible cases, each of which requires indi- 
vidual consideration. If the Incomplete write operation 
does not require any further access to the failed ser- 20 
vice unit (step 901), then the write operation proceeds 
normally at step 904. This would be the case either 
where the write operation never required access to 
the failed unit, or where the failed unit had already 
been accessed prior to its failure, if access is re- 25 
quired, but no read access is required (i.e., only write 
access is required, step 902), then the controller sinD- 
ply omits the write to the failed disl< unit, and other- 
wise continues the write operation normally as if the 
failed unit had been written to at step 905. This would 30 
be the case, for example, where steps 402,403 of Fig. 
4 had been completed prior to the disk unit failure, but 
where step 405 had not. It could also occur, for exam- 
ple, where a write operation involves all or nearly all 
of the blocks on a single stripe, and instead of reading 35 
each block before writing to produce a change mask 
as shown in Fig. 4, the blocks are either read only or 
written to only, and a change mask accumulated with 
each read or write, as described above. 

If read access to the failed unit Is required but 40 
write access is not, (step 903), then the incomplete 
write operation is a multl- block write operation updat- 
ing most of the blocks in the stripe, but not affecting 
the block on the failed unit. Because unaffected 
blocks are read before affected blocks are written to, 45 
none of the affected blocks has yet been altered. In 
this case it is possible to complete the incomplete 
write operation by reading each block to be updated 
before writing to it and accumulating a change mask, 
using the procedure of Fig. 4, at step 906. so 

The final case is where both read and write ac- 
cess to the failed unit is required (the "y^" branch 
from block 903). In this case the blocks on the same 
stripe in all remaining service units (other than the 
unit containing the parity block) are either read (if not ss 
requiring updating) at step 907 or written to at step 
908, and the data from each respective read or write 
successively Exclusive-ORed to accumulate parity. 
This partial parity is Exdusive-ORed with the data to 



be written to the failed unit to obtain the new parity at 
step 909, which is then written to the parity block at 
step 910. 

It will be appreciated that the array controller may 
have completed some of the steps explained above 
for a write operation at the time a disk unit fails, and 
in that case it would be unnecessary to repeat such 
steps because the product (change mask, read data, 
etc.) would be in the controller's dynamic memory 
203. 

After the incomplete write operations have been 
completed as described above, the write assist unit 
can assume the lunction of the failed service unit. 
The controller will update its status information to re- 
flect that the failed unit is no longer serviceable and 
the write assist unit is now the repository of the data 
formerly contained on the failed service unit. Data on 
the failed service unit can either be reconstructed at 
once, or can be reconstructed in blocks on demand. 
Such dynamic reconstruction techniques are descri- 
bed in US Patent Application Serial no. 07/542,216, 
filed June 21, 1990, herein incorporated by reference. 

In the event of loss of the contents of controller 
memory, the data to be written, as well as the list of 
Incomplete write operations, will be contained in the 
write assist unit 104. After controller operation is re- 
stored, the controller locates the most recent uncom- 
mitted list on the write assist unit, loads this list into 
Its dynamic memory, and performs each write oper- 
ation on the list to make the storage subsystem cur- 
rent. Because the most recent uncommitted list on 
the write assist unit is not necessarily updated each 
time a write operation completes, it is possible that 
some write operations on the uncommitted list will 
have already completed. However, rewriting this data 
will not affect data integrity. 

Fig. 10 shows the steps required to obtain the 
most recent uncommitted list from write assist disk 
unit 104. The controller first checks the status record 
211 in non-volatile RAM 204 for the address of a re- 
cent WAD record. If the contents of non-volatile RAM 
204 have been lost (step 1001), the current record is 
initialized to a block at a predefined location at the 
start of a disk sweep, at step 1002. The block at this 
location is always a header block, and will be either 
the header for a data record, the header for an update 
record at the end of a chain of data records, or the 
header of the initial record placed on the disk when 
formatted. If the contents of non-volatile RAM 204 are 
intact (step 1001), the current record is initialized to 
the record pointed to by the address value saved in 
non-volatile RAM. Since this value is periodically up- 
dated by the controller during actual operation, it is 
generally closer to the end of the chain of WAD re- 
cords than a record at the first address on the write 
assist unit. However, the chain of records on the as- 
sist unit can be traversed in either case. The controller 
reads the header of this first record. 
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if the command length specified in field 709 of 
header block 602 is 0 (Indicating it is not a data record) 
(step 1004], then the header at the predefined loca- 
tion contains the most current uncommitted list, and s 
this uncommitted list is loaded into the controller's dy- 
namic memory 203 at step 1012. If the command 
length in step 1004 is not 0, the header is part of a 
data record. The controller then reads the trailer block 
of the data record, which Is located an the offeet from io 
the header specified by the command length, at step 
1005. The controller then compares the trailer to the 
header at step 1006. If the blocks are not identical, 
then the writing of data was interrupted between the 
time that the header and trailer were written. In this 15 
case, the current data record is taken as the end of 
the chain, and the uncommitted list in the header is 
the most recent uncommitted list available. The con- 
troller loads this list into dynamic memory at step 
1012 and ends. If the trailer agrees with the header, 20 
the controller reads the header of the next record on 
the write assist unit at step 1007. This record is locat- 
ed at the address specified In next address field 704 
of the header for the current record. If the command 
ID specified in field 701 of the header for the next re- 25 
cord Is not one more than that of the current record 
(step 1008), the sequence of records has been inter- 
rupted, and the uncommitted list from the current re- 
cord is saved as the most recent uncommitted list at 
step 1012. If the command 10 in the header of the 30 
next record is exactly one more than that of the cur- 
rent record (step 1 008), then the next record is indeed 
part of the same chain. In this case, the next record 
becomes the "current" record at step 1009. The con- 
troller then checks the current record header to deter- 35 
mine whether the current record is another data re- 
cord or an update record at step 1010. If it is an update 
record (indicated by the next record address field 704 
being the same as the command address field 702, 
I.e., the record points to Itself), the end of the chain 40 
has been reached, and the uncommitted list from the 
current record header Is loaded in memory at step 
1012. if the current record Is another data record at 
step 1008, the program loops to step 1005, and re- 
peats steps 1 005-101 0 until a termination condition is 4S 
encountered. 

Fig. 11 shows the steps required to complete all 
incomplete WRITE operations identified on the un- 
committed list, once the uncommitted list has been re- 
covered using the procedure shown in Fig. 10. Be- so 
cause a WRITE operation on the uncommitted list 
may have been interrupted at any point, it must be as- 
sumed that parity blocks in the same stripe as data 
blocks to be written may contain erroneous parity. Ac- 
cordingly, the procedure Illustrated in Fig. 4 can not ss 
be employed to complete the WRITE operations. For 
each write operation on the uncommitted list, the con- 
troller first retrieves the data to be written from the 
write assist unit 104, and stores it in dynamic memory 



203, at step 1101. The controller then reads all data 
blocks on the stripe to be written to which do not re- 
quire updating, and accumulates a new partial parity 
by Exclusive-ORing each successh^ely read data 
block, at step 1102. The controller then writes the 
data blocks to be written to the respective service 
units, and successively Exclusive-ORs each written 
block with the partial parity to obtain the new parity, 
at step 1103. it should be noted that steps 1102 and 
1103 may involve no blocks read and all data blocks 
in the stripe written to, or may involve all data blocks 
but one read and only one written to, or any intermedi- 
ate combination. The final step is to write the new par- 
ity to the parity block at step 1104. Steps 1102-1104 
are repeated until all write operations on the unconv 
mitted list are completed (step 1105). An update re- 
cord containing an empty uncommitted list is then 
written to the end of the record chain on the write as- 
sist unit at step 1106. 

In the preferred embodiment, a single array con- 
troller services a plurality of disk drives in a storage 
subsystem. The disk drives themselves are redun- 
dant, enabling the subsystem to continue operation in 
the event of failure of a single drive, but the controller 
is not. Alternatively, it would be possible to operate 
the storage subsystem with multiple redundant con- 
trollers, enabling the system to remain operational in 
the event of failure of any single controller. Because 
the write assist unit maintains data redundancy. It 
would not be necessary for the multiple controllers to 
contain redundant uncommitted lists, command 
queues, and otherdata. For example, assuming prop- 
er physical connections exits, it would be possible to 
operate a subsystem having controllers A and B, in 
which controller A services disk drives 1 to N, and B 
services disk drives (N+1) to 2N. In the event of failure 
of any one controller, the other would service all disk 
drives 1 to 2N, using the information in the write assist 
unit to recover Incomplete write operations. In this 
case, the sut)system would continue to operate de- 
spite the failure of a single controller, although its per- 
formance may be degraded. 

In the preferred embodiment, a single write assist 
unit is associated with a single parity group of service 
units (I.e., a group of service units which share parity). 
However, it would alternatively be possible to operate 
a storage subsystem according to the present inven- 
tion with multiple write assist units. Additionally, it 
would be possible to operate a subsystem having 
multiple parity groups, in which one or more write as- 
sist units are shared among the various parity groups 
of service units. 

in the preferred embodiment, the service units 
are organized as a RAID level 5. Each stripe of stor- 
age blocks in the service units comprises a plurality 
of data blocks and a single parity block (data redun- 
dancy block). Multiple stripes exist, in which the parity 
blocks are distributed among different service units. 



10 



19 



EP 0 569 313 A2 



20 



The use of a single parity block provides the simplest 
form of data redundancy, and it Is believed that dis- 
tributing the parity blocks provides the best perfor- 
mance. However, in the alternative it would be possi- 
ble to practice the present invention using other types 
of storage unit arrays. For example, there could be 
but a single stripe of blocks, or all parity blocks could 
be on a single service unit, as in the case of a RAID- 
3 or RAID-4. Rather than a single parity block, it 
would be possible to practice this invention using 
more complex error correcting or detecting codes or 
multi-dimensional parity stored on multiple data re- 
dundancy blocks, as In the case of a RAID-2. 

In the preferred embodiment, all storage units 
have the same storage capacity. This simplifies the 
control mechanism and facilitates substitution of one 
unit for another. However, it would alternatively be 
possible to practice the present invention to units of 
varying capacities. In particular, the write assist unit 
might be larger than the service units, enabling it to 
maintain write assist function even If it is also used to 
store data reconstructed from a failed storage device. 

In the preferred embodiment, the write assist unit 
is used as a sequentially written log of the incomplete 
write operations. However, It may alternatively be 
possible to use the write assist unit In other ways. For 
example, data would not have to be sequentially writ- 
ten to the assist unit, and could be random access. 
The assist unit could be used for other purposes, such 
as a read cache. The assist unit might be used in an 
assist mode for any function which would improve 
performance and/or redundancy, while simultane- 
ously having the capability to switch to a service unit 
operating mode, thereby doubling as a spare unit. 

In the preferred embodiment, the storage units 
are rotating magnetic disk drive storage units. Such 
units are standard in the Industry at the present time. 
However, it would be possible to operate a storage 
subsystem according to the present Invention having 
storage units employing a different technology. For 
example, optical disk storage units may be employed. 

Although a specific embodiment of the invention 
has been disclosed along with certain alternatives, it 
will be recognized by those skilled in the art that ad- 
ditional variations In form and detail may be made 
within the scope of the following claims. 



Claims 

1. A storage subsystem for a computer system, 
comprising: 

a storage subsystem controller, said controller 
having a processor and a memory; 
at least four data storage units coupled to said 
controller, wherein at least one of said data stor- 
age units is a write assist data storage unit, and 
at least three of said data storage units are ser- 



vice data storage units; 

at least one stripe of storage blocks, each stripe 
comprising a plurality of data storage blocks for 

5 containing data and at least one data redundancy 

storage block for containing data redundant of the 
data stored in said data storage blocks, each of 
said storage blocks being contained on a respec- 
tive service data storage unit; 

10 means in said controller for maintaining said data 

redundancy storage block on said stripe of stor- 
age blocks; 

means in said controller for receiving data to be 
stored on said data storage units; 
IS means for writing said data to be stored to said 

write assist unit; 

means in said controller for signalling operation 
complete after writing said data to said write as- 
sist unit and before writing said data to any of said 

20 service data storage units; 

means for reconstructing said data in the event 
any one of said data storage units fails after sig- 
nalling operation complete; and 
means for reconstructing said data In the event 

25 the contents of said memory are lost after signal- 

ling operation complete. 

2. The storage subsystem of claim 1, further com- 
prising: 

30 means for storing data reconstructed from a fall- 

ing service data storage unit on said write assist 
unit 

means for operating said write assist unit as said 
failing service unit after said data reconstructed 
35 from said failing service unit has been stored on 

said write assist unit. 

3. The storage subsystem of claim 1, wherein said 
data redundancy storage block comprises: 

40 a parity storage block for containing the parity of 

data stored In said data storage blocks, 
at least two of said stripes of storage blocks, 
wherein said parity storage blocks are distributed 
among said service data storage units in a round 

45 robin manner. 



4. A storage apparatus for a computer system, com- 
prising: 

50 a write assist data storage unit; 

a plurality of service data storage units; 

means for maintaining data redundancy among 

said plurality of service data storage units; 

means for temporarily storing data to be written to 
55 said service data storage units in said write assist 

unit; 

means for reconstructing data stored on a ser- 
vice data storage units in the event of failure of 
said unit; and 
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means for storing said reconstructed data on said 
write assist unit. 

5. The storage apparatus of claim 4. wherein said s 
means for maintaining data redundancy compris- 
es: 

at least one stripe of storage blocks, each stripe 
comprising a plurality of data storage blocks for 
containing data and one parity storage block for io 
containing parity of the data stored in said data 
storage blocks, each of said storage blocks being 
contained on a respective service data storage 
unit; 

means for determining the parity of said plurality is 
of data storage blocks; and 
means for storing said parity of said plurality of 
data storage blocks in said parity storage block. 

6. The storage apparatus of claim 4, further conv 20 

prising: 

means for disabling the write assist function of 
said write assist unit in the event of failure of a 
service data storage unit; and 

means for operating said write assist unit as said 2S 
service unit which failed. 

7. The storage apparatus of claim 4, wherein said 
means for maintaining data redundancy compris- 
es: 30 
at least one stripe of storage blocks, each stripe 
comprising a plurality of data storage blocks for 
containing data and one parity storage block for 
containing parity of the data stored in said data 
storage blocks, each of said storage blocks being 35 
contained on a respective service data storage 

unit; 

means for determining the parity of said plurality 
of data storage blocks; and 

means for storing said parity of said plurality of 40 
data storage blocks in said parity storage block. 

8. The storage apparatus of daim 8, further com- 
prising: 

selection means for selectively determining 45 
whether said data to be written to said servtee 
units should be temporarily stored in said write 
assist unit 

wherein said means for temporarily storing data 
to be written to said service units in said write as- so 
sist unit selectively writes data to said write assist 
unit in response to said determination made by 
said selection means. 



A method for storing data in a computer system, 
comprising the steps of: 

storing data redundantly on a plurality of service 
data storage units; 

writing updated data to be written to said plurality 
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of service data storage units to a write assist data 
storage unit; 

signalling that said updated data has been written 
to said plurality of service data storage units; 
writing said updated data redundantly to said 
plurality of service data storage units, wherein 
said step of writing said updated data to said plur- 
ality of service data storage units is completed af- 
ter said signalling step; 

reconstructing data stored in a service data stor- 
age unit in the event of failure of said service data 
storage unit; and 

storing said reconstructed data on said write as- 
sist unit, and thereafter operating said write as- 
sist unit as said service unit which failed, in the 
event of said failure of said service data storage 
unit 

10. The method of ciaim 9, 

wherein said step of storing data redundantly on 
a plurality of service data storage units comprises 
storing data on at least one stripe of storage 
blocks, each stripe comprising a plurality of data 
storage blocks for containing data and one parity 
storage block for containing parity of the data 
stored in said data storage blocks, each of said 
storage blocks being contained on a respective 
service data storage unit; and 
wherein said step of writing said updated data re- 
dundantly to said plurality of service data storage 
units comprises updating said parity storage 
block of a stripe of storage blocks being updated. 

11. A storage subsystem controller for a computer 
system, comprising: 

a processor; 
a memory; 

a host interface for communicating with a host 
computer system; 

a storage unit interface for communicating with at 
least four data storage units coupled to said con- 
troller, wherein at least one of said data storage 
units is a write assist data storage unit, and at 
least three of satel data storage units are service 
data storage units, 

wherein said service data storage units comprise 
at least one stripe of storage blocks, each stripe 
comprising a plurality of data storage blocks for 
containing data and at least one data redundancy 
storage block for containing data redundant of the 
data stored in said data storage blod^, each of 
sakl storage blocks being contained on a respec- 
tive service data storage unit; 
means for maintaining said data redundancy stor- 
age block on said stripe of storage blocks; 
means for receiving data to be stored on said data 
storage units from said host computer system; 
means for writing said data to be stored to said 
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write assist unit; 

means for signalling operation complete to said 
host computer system after writing said data to 
said write assist unit and before writing said data 5 
to any of said service data storage units; 
means for reconstructing said data In the event 
any one of said data storage units fails after sig- 
nalling operation complete; and 

means for reconstructing said data in the event io 
the contents of said memory are lost after signal- 
ling operation complete. 

12. The storag e su bsystem controller of claim 1 1 , fur- 
ther comprising: 15 
means for storing data reconstructed from a fail- 
ing service data storage unit on said write assist 
unit 

means for operating said write assist unit as said 
falling service unit after said data reconstructed 20 
from said failing service unit has been stored on 
said write assist unit. 



means for storing said parity of said plurality of 
data storage blocl^s in said parity storage bloclc. 

16. The storage apparatus of claim 15, 

wherein said plurality of service data storage 
units contain at least two of said stripes of storage 
blocks, and 

wherein said parity storage bloclcs are distributed 
among said service data storage units in a round 
robin manner. 



13. The storage subsystem controller of claim 11, 

wherein said data redundancy storage bloci< 2S 
comprises a parity storage blocic for containing 
the parity of data stored in said data storage 
bloclcs. 



1 4. A storage apparatus for a computer system, com- 30 
prising: 

a plurality of service data storage units; 
an additional data storage unit capable of becom- 
ing a spare unit; 

means for operating said additional unit in an as- 35 
sist mode to assist the function of said service 
data storage units; 

means for maintaining data redundancy among 
said plurality of service data storage units; 
means for reconstructing data stored on a ser- 40 
vice data storage units in the event of failure of 
said unit; 

means for storing said reconstructed data on said 
additional unit capable of becoming a spare unit, 
and thereafter operating said additional unit as 4S 
said service unit which failed. 



15. The storage apparatus of dalm 14, wherein said 
means for maintaining data redundancy compris- 
es: so 
at least one stripe of storage blocks, each stripe 
comprising a plurality of data storage blocks for 
containing data and one parity storage block for 
containing parity of the data stored in said data 
storage blocks, each of said storage blocks being ss 
contained on a respective service data storage 
unit; 

means for determining the parity of said plurality 
of data storage blocks; and 
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